📊 Bioinformatics Data Visualization¶

In this notebook, we'll demonstrate how to visualize different types of bioinformatics data:

  • Sequence-level properties like length and GC content
  • Simulated gene expression with a volcano plot
  • 3D molecular structures using py3Dmol

We'll use real bacterial data and a known protein structure from the Protein Data Bank (PDB).

1️⃣ Sequence Feature Visualization¶

📥 Download E. coli CDS Sequences¶

We’ll fetch coding sequences from NCBI for visualization.

In [1]:
import urllib.request
url = "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_cds_from_genomic.fna.gz"
urllib.request.urlretrieve(url, "ecoli_cds.fna.gz")
print("✅ Downloaded E. coli CDS FASTA")
✅ Downloaded E. coli CDS FASTA

📊 Calculate Sequence Lengths and GC Content¶

Let’s parse the FASTA file and compute basic stats.

In [2]:
from Bio import SeqIO
from Bio.SeqUtils import gc_fraction
import gzip

seq_lengths = []
gc_contents = []

with gzip.open("ecoli_cds.fna.gz", "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        seq_lengths.append(len(record.seq))
        gc_contents.append(gc_fraction(record.seq) * 100)

print(f"Parsed {len(seq_lengths)} sequences.")
Parsed 4315 sequences.

📈 Plot: Length Distribution¶

Histogram of sequence lengths.

In [3]:
import matplotlib.pyplot as plt
plt.hist(seq_lengths, bins=50, color='lightblue')
plt.title("CDS Length Distribution")
plt.xlabel("Length (bp)")
plt.ylabel("Frequency")
plt.show()
No description has been provided for this image

🌐 Plot: GC Content vs Length¶

Explore how GC% varies with sequence length.

In [4]:
plt.scatter(seq_lengths, gc_contents, alpha=0.5)
plt.title("GC Content vs CDS Length")
plt.xlabel("Length (bp)")
plt.ylabel("GC Content (%)")
plt.show()
No description has been provided for this image

2️⃣ Gene Expression Volcano Plot (Simulated)¶

🧪 Simulate Expression Data for Volcano Plot¶

Create a mock expression dataset to visualize differential expression.

In [5]:
import pandas as pd
import numpy as np
import plotly.express as px

np.random.seed(42)
df = pd.DataFrame({
    'log2FC': np.random.normal(0, 2, 500),
    'pval': np.random.uniform(0, 1, 500)
})
df['-log10(pval)'] = -np.log10(df['pval'])
df['Significant'] = (abs(df['log2FC']) > 1) & (df['pval'] < 0.05)

🌋 Plot: Simulated Volcano Plot¶

A scatter plot showing significance and fold change.

In [6]:
fig = px.scatter(df, x='log2FC', y='-log10(pval)', color='Significant', title="Simulated Volcano Plot")
fig.show()

3️⃣ Protein Structure: 3D Visualization with py3Dmol¶

📦 Download Hemoglobin Protein Structure (1A3N)¶

We’ll use this for 3D molecular visualization.

In [7]:
import urllib.request
url = "https://files.rcsb.org/download/1A3N.pdb"
urllib.request.urlretrieve(url, "1A3N.pdb")
print("✅ Downloaded 1A3N.pdb")
✅ Downloaded 1A3N.pdb
In [8]:
import warnings
from Bio import BiopythonWarning
from Bio.PDB import PDBParser

# Suppress Biopython warnings
warnings.simplefilter('ignore', BiopythonWarning)

# Load and parse the structure
parser = PDBParser()
structure = parser.get_structure("Hemoglobin", "1A3N.pdb")

# Print chain IDs
print("Chains in structure:")
for model in structure:
    for chain in model:
        print(" - Chain ID:", chain.id)
Chains in structure:
 - Chain ID: A
 - Chain ID: B
 - Chain ID: C
 - Chain ID: D

🧬 Interactive 3D Viewer Setup with py3Dmol¶

Let’s visualize the protein in 3D using a cartoon model.

In [9]:
import py3Dmol

view = py3Dmol.view(query='pdb:1A3N')
view.setStyle({'cartoon': {'color': 'spectrum'}})
view.zoomTo()
view.show()

3Dmol.js failed to load for some reason. Please check your browser console for error messages.

🔬 Tip: Annotating Protein Regions (Optional)¶

To highlight active sites, ligands, or domains, you can add selections like:

view.addStyle({'chain': 'A', 'resn': 'HEM'}, {'stick': {}})

You can also add labels and spheres to residues for educational demos or reports.

In [10]:
import py3Dmol

view = py3Dmol.view(query='pdb:1A3N')
view.addStyle({'chain': 'A', 'resn': 'HEM'}, {'stick': {}})
view.zoomTo()
view.show()

3Dmol.js failed to load for some reason. Please check your browser console for error messages.

🧬 Interactive Protein Viewer with Chain and Style Selection¶

This viewer let you:

  • Select a chain (A, B, C, or D)
  • Choose a visual style (cartoon, stick, or surface)
In [11]:
import py3Dmol
import ipywidgets as widgets
from IPython.display import display

# Define available options
chains = ['A', 'B', 'C', 'D']
styles = ['cartoon', 'stick', 'surface']

# Create widgets
chain_selector = widgets.Dropdown(
    options=chains,
    value='A',
    description='Chain:',
    style={'description_width': 'initial'}
)

style_selector = widgets.Dropdown(
    options=styles,
    value='cartoon',
    description='Style:',
    style={'description_width': 'initial'}
)

# Function to update viewer
def update_viewer(chain_id, style):
    view = py3Dmol.view(query='pdb:1A3N')
    view.setStyle({'cartoon': {'color': 'lightgrey'}})
    view.addStyle({'chain': chain_id}, {style: {'color': 'red'}})
    view.zoomTo()
    view.show()

# Display widgets together
ui = widgets.HBox([chain_selector, style_selector])
out = widgets.interactive_output(update_viewer, {'chain_id': chain_selector, 'style': style_selector})

display(ui, out)
HBox(children=(Dropdown(description='Chain:', options=('A', 'B', 'C', 'D'), style=DescriptionStyle(description…
Output()